NVIDIA-472: Add CSR approval for DPU worker nodes#42
NVIDIA-472: Add CSR approval for DPU worker nodes#42tsorya merged 1 commit intorh-ecosystem-edge:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a CSR auto-approval subsystem: new CSRApprovalReconciler and CSRApprover with hosted-cluster client caching, CSR content and owner validation, automated approval flow, tests, RBAC/helm rule additions, and per-controller event recorders and constants. Changes
Sequence DiagramsequenceDiagram
participant Op as Operator
participant Mgmt as Management\r\nCluster
participant Hosted as Hosted\r\nCluster
participant DPU as DPU/Node\r\nStore
Op->>Mgmt: Reconcile DPFHCPProvisioner
Mgmt-->>Op: DPFHCPProvisioner + KubeConfigSecretRef
Op->>Mgmt: Fetch kubeconfig secret
Mgmt-->>Op: kubeconfig data
Op->>Hosted: Build/reuse client (ClientManager cache)
Op->>Hosted: Test connectivity (ServerVersion)
Hosted-->>Op: reachable / error
Op->>Hosted: List pending CSRs
Hosted-->>Op: CSR list
loop for each pending CSR
Op->>Op: Extract hostname from CSR
Op->>Op: Validate CSR content (usages, org, CN/DNS/IP)
Op->>Mgmt: Check DPU exists
Mgmt-->>Op: DPU found / not found
alt DPU found
Op->>Hosted: Check Node exists
Hosted-->>Op: Node found / not found
end
alt all validations pass
Op->>Hosted: Approve CSR (UpdateApproval)
Op->>Mgmt: Record approval event
else validation fails
Op->>Mgmt: Record denial/error event
end
end
Op->>Mgmt: Update CSRAutoApprovalActive condition
Op->>Op: Requeue after 30s
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
1b511d3 to
b3b6a0b
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
internal/controller/csrapproval/csr_owner_validation.go (1)
89-103: UseGetby name instead ofList+ linear scan.
dpuExistslists all DPUs in the namespace and iterates to find a matching name. A directGetcall is more efficient (O(1) API call vs. listing potentially many objects). Same applies tonodeExistsbelow.♻️ Proposed refactor for dpuExists
func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) { - dpuList := &dpuprovisioningv1alpha1.DPUList{} - if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil { - return false, err - } - - for _, dpu := range dpuList.Items { - if dpu.Name == hostname { - return true, nil - } - } - - return false, nil + dpu := &dpuprovisioningv1alpha1.DPU{} + err := v.mgmtClient.Get(ctx, client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname}, dpu) + if err != nil { + if client.IgnoreNotFound(err) == nil { + return false, nil + } + return false, err + } + return true, nil }♻️ Proposed refactor for nodeExists
func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) { - nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) - if err != nil { - return false, err - } - - for _, node := range nodeList.Items { - if node.Name == hostname { - return true, nil - } - } - - return false, nil + _, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) + if err != nil { + if errors.IsNotFound(err) { + return false, nil + } + return false, err + } + return true, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 89 - 103, Replace the List+scan approach in dpuExists and nodeExists with direct Get calls: for dpuExists use v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace}, &dpuprovisioningv1alpha1.DPU{}) and return true,nil if found, return false,nil on apierrors.IsNotFound(err) and return false,err for other errors; for nodeExists (cluster-scoped) use v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname}, &corev1.Node{}) with the same NotFound/other-error handling and return true,nil when present. Ensure you import types and apierrors if not already present.internal/controller/csrapproval/hostname_test.go (1)
297-391: Consider consolidating duplicated CSR helper functions.
createTestCSR,createTestCSRWithOrganization, andcreateServingCSRWithDNSshare the same key generation, CSR creation, PEM encoding, and base64 encoding boilerplate. A single parameterized helper would reduce duplication.♻️ Proposed consolidated helper
-func createTestCSR(cn string) []byte { - privateKey, err := rsa.GenerateKey(rand.Reader, 2048) - if err != nil { - panic(err) - } - template := x509.CertificateRequest{ - Subject: pkix.Name{ - CommonName: cn, - }, - } - csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey) - if err != nil { - panic(err) - } - pemBlock := pem.EncodeToMemory(&pem.Block{ - Type: "CERTIFICATE REQUEST", - Bytes: csrBytes, - }) - return []byte(base64.StdEncoding.EncodeToString(pemBlock)) -} - -func createTestCSRWithOrganization(cn, org string) []byte { ... } -func createServingCSRWithDNS(cn, org, dnsName string) []byte { ... } +// buildTestCSR creates a test CSR with optional organization and DNS names. +func buildTestCSR(cn string, org string, dnsNames ...string) []byte { + privateKey, err := rsa.GenerateKey(rand.Reader, 2048) + if err != nil { + panic(err) + } + template := x509.CertificateRequest{ + Subject: pkix.Name{ + CommonName: cn, + }, + DNSNames: dnsNames, + } + if org != "" { + template.Subject.Organization = []string{org} + } + csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey) + if err != nil { + panic(err) + } + pemBlock := pem.EncodeToMemory(&pem.Block{ + Type: "CERTIFICATE REQUEST", + Bytes: csrBytes, + }) + return []byte(base64.StdEncoding.EncodeToString(pemBlock)) +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/hostname_test.go` around lines 297 - 391, The three helpers createTestCSR, createTestCSRWithOrganization, and createServingCSRWithDNS duplicate key generation, CSR creation, PEM encoding, and base64 encoding; replace them with a single parameterized helper (e.g., buildCSR(cn string, orgs []string, dnsNames []string) []byte) that generates the RSA key, constructs an x509.CertificateRequest using Subject.CommonName and Subject.Organization from orgs, sets DNSNames from dnsNames, creates the CSR, PEM-encodes it and returns base64 bytes; update or remove the three original functions to call or forward to buildCSR to eliminate duplication.internal/controller/csrapproval/client.go (1)
50-68: Cached client is never invalidated on error — stale connections will persist.When a cached clientset becomes stale (e.g., kubeconfig rotated, API server restarted),
GetHostedClusterClientkeeps returning it without validation.InvalidateClientexists but nothing calls it on connection failures.TestConnectionis defined but never invoked from withinGetHostedClusterClient.Consider invalidating the cache entry when the caller encounters connection errors, or proactively test the cached client before returning it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/client.go` around lines 50 - 68, GetHostedClusterClient currently returns cached entries from cm.hcClients without validating them; call the existing TestConnection method on the cached kubernetes.Clientset (keyed by namespace+"/"+name) before returning and if TestConnection returns an error, call InvalidateClient for that key and fall through to createHostedClusterClient to rebuild the client; ensure createHostedClusterClient results are only cached on successful creation and that errors from TestConnection lead to recreating the client instead of returning a stale one.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 31-39: ClientManager.hcClients is a plain map accessed
concurrently and can panic; add a sync.RWMutex field (e.g., hcMu sync.RWMutex)
to the ClientManager struct and use it to protect all accesses: use
hcMu.RLock()/RUnlock() around read paths in GetHostedClusterClient and any
callers that only read, and use hcMu.Lock()/Unlock() around write paths where
you create or delete entries (e.g., when inserting a new *kubernetes.Clientset
in GetHostedClusterClient or removing entries in InvalidateClient). Ensure every
access to hcClients (reads and writes) in the methods referenced
(GetHostedClusterClient, InvalidateClient, and any other helpers) is guarded by
the appropriate lock.
- Around line 90-93: The client currently sets config.Timeout = 0 which allows
API calls (used by the CSR list/approve logic) to block indefinitely; change the
timeout to a reasonable value shorter than the 30s polling interval (e.g., set
config.Timeout to 10s or 15s) so list/approve operations in
internal/controller/csrapproval/client.go will fail fast instead of hanging;
update the assignment of config.Timeout (near where config.QPS and config.Burst
are set) to a non-zero duration and ensure any callers expect a timeout error
instead of a hang.
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 257-293: The CSR IP validation is inconsistent:
validateServingIdentities currently only checks nil and IsUnspecified while
hasValidIPAddress (unused) also rejects loopback and multicast; update
validateServingIdentities to call hasValidIPAddress(ip) for each
certReq.IPAddresses and return a clear error (e.g., "CSR contains invalid IP
address: <ip>") when it returns false, or alternatively remove the unused
hasValidIPAddress if you prefer the simpler checks—prefer the former to
centralize IP logic in hasValidIPAddress.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 262-295: In setCondition, avoid appending a duplicate when the
condition already exists by introducing a found flag to distinguish "condition
found but unchanged" from "condition not found"; iterate over
dpfhcp.Status.Conditions, set found=true when existing.Type ==
provisioningv1alpha1.CSRAutoApprovalActive, and only replace the element and set
changed=true if existing.Status != status || existing.Reason != reason, then
break; after the loop, if !found append the new condition (and set
changed=true), and finally call a.mgmtClient.Status().Update(ctx, dpfhcp) only
if changed.
- Around line 170-193: The approvedCount is incremented for any nil error from
processCSR, but processCSR currently returns nil for both skipped
(validation-failed) and actually-approved CSRs; change processCSR signature to
return (bool, error) where the bool indicates whether the CSR was approved
(true) or skipped (false), update its return sites to return (false, nil) for
validation skips and (true, nil) for real approvals, then update the callers in
the loops that currently call processCSR (the bootstrap and serving CSR loops)
to unpack (approved, err) and only increment approvedCount when approved ==
true; ensure all usages of processCSR (including any tests) are updated
accordingly.
In `@internal/controller/csrapproval/hostname_test.go`:
- Around line 297-326: The helper functions createTestCSR,
createTestCSRWithOrganization, and createServingCSRWithDNS are double-encoding
the CSR (PEM then base64) but parseCSRRequest/ExtractHostname expect raw PEM
bytes; remove the base64.StdEncoding.EncodeToString(...) call in each helper and
return the PEM bytes directly (e.g., return pemBlock or []byte(pemBlock)) so
tests supply plain PEM instead of base64-encoded PEM.
---
Duplicate comments:
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 211-232: The parseCSRRequest function is incorrectly
base64-decoding csr.Spec.Request (which is already raw []byte); remove the call
to base64.StdEncoding.DecodeString and instead pass csr.Spec.Request directly to
pem.Decode, adjust variable names accordingly (e.g., remove csrBlock), and
delete the now-unused encoding/base64 import; keep the existing error checks and
the call to x509.ParseCertificateRequest on pemBlock.Bytes.
---
Nitpick comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 50-68: GetHostedClusterClient currently returns cached entries
from cm.hcClients without validating them; call the existing TestConnection
method on the cached kubernetes.Clientset (keyed by namespace+"/"+name) before
returning and if TestConnection returns an error, call InvalidateClient for that
key and fall through to createHostedClusterClient to rebuild the client; ensure
createHostedClusterClient results are only cached on successful creation and
that errors from TestConnection lead to recreating the client instead of
returning a stale one.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 89-103: Replace the List+scan approach in dpuExists and nodeExists
with direct Get calls: for dpuExists use v.mgmtClient.Get(ctx,
types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace},
&dpuprovisioningv1alpha1.DPU{}) and return true,nil if found, return false,nil
on apierrors.IsNotFound(err) and return false,err for other errors; for
nodeExists (cluster-scoped) use v.mgmtClient.Get(ctx, types.NamespacedName{Name:
hostname}, &corev1.Node{}) with the same NotFound/other-error handling and
return true,nil when present. Ensure you import types and apierrors if not
already present.
In `@internal/controller/csrapproval/hostname_test.go`:
- Around line 297-391: The three helpers createTestCSR,
createTestCSRWithOrganization, and createServingCSRWithDNS duplicate key
generation, CSR creation, PEM encoding, and base64 encoding; replace them with a
single parameterized helper (e.g., buildCSR(cn string, orgs []string, dnsNames
[]string) []byte) that generates the RSA key, constructs an
x509.CertificateRequest using Subject.CommonName and Subject.Organization from
orgs, sets DNSNames from dnsNames, creates the CSR, PEM-encodes it and returns
base64 bytes; update or remove the three original functions to call or forward
to buildCSR to eliminate duplication.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
internal/controller/csrapproval/csr_owner_validation.go (2)
119-133: Same improvement applies tonodeExists— use a directGetinstead of listing all nodes.Listing all nodes in the hosted cluster on every CSR is expensive, especially as clusters grow. A direct
Getby name is more efficient:♻️ Proposed refactor
func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) { - nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) - if err != nil { - return false, err - } - - for _, node := range nodeList.Items { - if node.Name == hostname { - return true, nil - } - } - - return false, nil + _, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) + if err != nil { + if apierrors.IsNotFound(err) { + return false, nil + } + return false, err + } + return true, nil }You'll need to import
apierrors "k8s.io/apimachinery/pkg/api/errors".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 119 - 133, Replace the expensive list in nodeExists with a direct node Get: call v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) (in function nodeExists) and handle errors using apierrors.IsNotFound(err) to return (false, nil) when the node doesn't exist, otherwise return (false, err) for other errors; add the import apierrors "k8s.io/apimachinery/pkg/api/errors".
104-117: Consider using a directGetinstead of listing all DPUs.
dpuExistslists every DPU in the namespace and iterates to find a name match. Since you're matching bydpu.Name == hostname, a directGetby name would be O(1) vs O(n) and avoids transferring the full list:♻️ Proposed refactor
func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) { - dpuList := &dpuprovisioningv1alpha1.DPUList{} - if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil { - return false, err - } - - for _, dpu := range dpuList.Items { - if dpu.Name == hostname { - return true, nil - } - } - - return false, nil + dpu := &dpuprovisioningv1alpha1.DPU{} + key := client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname} + if err := v.mgmtClient.Get(ctx, key, dpu); err != nil { + if client.IgnoreNotFound(err) == nil { + return false, nil + } + return false, err + } + return true, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 104 - 117, Replace the List+loop in Validator.dpuExists with a direct Get by namespaced name: construct a dpuprovisioningv1alpha1.DPU object and call v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace}, &dpu); if Get returns nil return (true, nil), if it returns a NotFound error (use k8s.io/apimachinery/pkg/api/errors.IsNotFound) return (false, nil), otherwise return (false, err). This avoids iterating over dpuList and keeps the function signature and return semantics the same.internal/controller/csrapproval/csrapproval.go (1)
260-261: A newValidatoris instantiated per CSR — consider reusing it across the batch.
NewValidatoris created insideprocessCSR, which is called for every pending CSR. TheValidatorstruct is lightweight, but it triggers list operations (dpuExists,nodeExists) that query the API for each CSR. Creating theValidatoronce inprocessPendingCSRsand passing it toprocessCSRwould allow potential future optimizations (e.g., caching the DPU list once per polling cycle).♻️ Proposed refactor sketch
// In processPendingCSRs, create the validator once: + validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) // Then pass it to processCSR: -func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string) error { +func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string, validator *Validator) error {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csrapproval.go` around lines 260 - 261, The Validator is being created per CSR inside processCSR (via NewValidator), causing repeated list queries; instead, create a single Validator in processPendingCSRs and pass it into processCSR as a parameter (e.g., add a *Validator arg to processCSR and remove the NewValidator call inside it). Update all callers of processCSR accordingly, and ensure the existing methods on Validator (dpuExists, nodeExists) are used from the shared instance so future caching or batched list optimizations can be added centrally.internal/controller/csrapproval/csr_content_validation.go (1)
162-208: Consider extracting a generic usage validation helper to reduce duplication.
validateBootstrapUsagesandvalidateServingUsagesare nearly identical, differing only in the required usage set. A singlevalidateRequiredUsages(csr, required)helper would eliminate the duplication.♻️ Proposed refactor
+func validateRequiredUsages(csr *certv1.CertificateSigningRequest, required []certv1.KeyUsage) error { + usageSet := make(map[certv1.KeyUsage]bool, len(required)) + for _, u := range required { + usageSet[u] = false + } + for _, usage := range csr.Spec.Usages { + if _, ok := usageSet[usage]; ok { + usageSet[usage] = true + } + } + for usage, found := range usageSet { + if !found { + return fmt.Errorf("missing required usage: %s", usage) + } + } + return nil +} + func validateBootstrapUsages(csr *certv1.CertificateSigningRequest) error { - requiredUsages := map[certv1.KeyUsage]bool{...} - ... + return validateRequiredUsages(csr, []certv1.KeyUsage{ + certv1.UsageDigitalSignature, + certv1.UsageClientAuth, + }) } func validateServingUsages(csr *certv1.CertificateSigningRequest) error { - requiredUsages := map[certv1.KeyUsage]bool{...} - ... + return validateRequiredUsages(csr, []certv1.KeyUsage{ + certv1.UsageDigitalSignature, + certv1.UsageServerAuth, + }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_content_validation.go` around lines 162 - 208, Extract the duplicated logic in validateBootstrapUsages and validateServingUsages into a single helper validateRequiredUsages that takes the CSR and the set/list of required usages (e.g. func validateRequiredUsages(csr *certv1.CertificateSigningRequest, required []certv1.KeyUsage) error or a map), move the loop that marks found usages and the final missing-check into that helper, and then have validateBootstrapUsages and validateServingUsages call validateRequiredUsages with their respective required usage sets (UsageDigitalSignature+UsageClientAuth and UsageDigitalSignature+UsageServerAuth) so behavior and error messages (fmt.Errorf("missing required usage: %s", usage)) remain the same.internal/controller/csrapproval/controller.go (1)
34-39:SchemeandRecorderfields are unused in this controller.
SchemeandRecorderare set during wiring inmain.gobut never referenced insideReconcileorSetupWithManager. If they exist only for structural consistency with the main reconciler, consider documenting that intent; otherwise, removing them avoids confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/controller.go` around lines 34 - 39, The CSRApprovalReconciler struct contains unused fields Scheme and Recorder; either remove these fields from CSRApprovalReconciler or explicitly document why they’re retained (e.g., wiring consistency with main.go) and reference their intended use. Locate the CSRApprovalReconciler type definition and delete Scheme *runtime.Scheme and Recorder record.EventRecorder if they are not referenced by Reconcile or SetupWithManager, or add a comment above the struct explaining they are intentionally kept for wiring in main.go and will be used in future logic; update any constructor/wiring code in main.go accordingly to avoid linter confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/main.go`:
- Around line 248-249: The CSR approver is being initialized with
mgr.GetEventRecorderFor("dpfhcpprovisioner-controller") and mgr.GetClient(),
which misattributes events to the main controller and is inconsistent with the
reconciler; change the NewCSRApprover call that creates csrApprover to use the
same local client variable (client) and use
mgr.GetEventRecorderFor("csrapproval-controller") so events are attributed
correctly and the client reference matches the reconciler (update the call to
csrapproval.NewCSRApprover(...)).
In `@internal/controller/csrapproval/client.go`:
- Around line 203-210: TestConnection currently calls
clientset.Discovery().ServerVersion() which blocks and ignores ctx (and relies
on config.Timeout being non-zero); update TestConnection to respect the provided
context by using a context-aware discovery REST call (e.g. use
clientset.Discovery().RESTClient().Get().AbsPath("/version").Do(ctx).Into(&versionInfo)
or the equivalent context-taking client-go call) or, if you cannot change the
callsite, ensure the REST config used to build clientset has a non-zero Timeout
(config.Timeout > 0) so the call is bounded; modify the TestConnection function
(and related client creation) to use the context-aware request path or guarantee
config.Timeout is set to a sensible non-zero duration when constructing the
kubernetes.Clientset.
In `@internal/controller/csrapproval/controller.go`:
- Around line 65-71: Add a nil-check for r.Approver before calling its methods
to avoid nil-pointer panics: in the deletion handling and in the code that calls
r.Approver.ProcessCSRs (and the block covering the current ProcessCSRs usage
around lines 87-93), wrap the calls with if r.Approver != nil { ... } else { log
a warning/info } so StopCSRWatch and ProcessCSRs are only invoked when Approver
is non-nil; reference the Approver field and its methods StopCSRWatch and
ProcessCSRs to locate and update the calls.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 56-101: The current ValidateCSR implementation rejects bootstrap
CSRs when nodeExists is true; update logic to permit legitimate re-bootstrap
scenarios by inspecting the existing Node's status or age instead of outright
rejecting: change or augment nodeExists to return the Node object or its Ready
condition/timestamps (e.g., modify nodeExists to return (*v1.Node, bool, error)
or add a new getNode helper), then in ValidateCSR when isBootstrapCSR &&
nodeExists check the Node's Ready condition (allow if NotReady) or compare
Node.CreationTimestamp/LastTransitionTime against a configurable gracePeriod to
allow re-bootstrap for older/stale Nodes; keep existing rejection for cases that
indicate a true duplicate join (Node Ready and recent within grace period).
Ensure you reference/adjust ValidateCSR, nodeExists (or new getNode), and any
configurable gracePeriod/dpuNamespace fields accordingly.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 74-84: The cached hosted-cluster client can be stale when
TestConnection fails, so on TestConnection error you must invalidate the cached
client and trigger a retry path: call
a.clientManager.InvalidateHostedClusterClient(ctx, dpfhcp.Namespace,
dpfhcp.Name) (or add such a method if missing) when TestConnection returns an
error for the client returned by a.clientManager.GetHostedClusterClient, then
record the condition/event and requeue (or return so the next reconciliation
will create a fresh client); ensure invalidation is performed before returning
from the handler that processes TestConnection failures.
- Around line 86-95: TestConnection failing currently logs and returns without
requeue but leaves the possibly-stale cached hosted-cluster client (hcClient)
intact; update the reconciler so when TestConnection(ctx, hcClient) returns an
error you explicitly invalidate the cached client before returning (e.g., clear
the controller's cached client field or call a new invalidateHostedClusterClient
method), so the next reconciliation will recreate a fresh client; keep the
existing setCondition and Event calls and then return as before.
---
Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 31-39: ClientManager's hcClients map is unsynchronized and can
cause concurrent map read/write panics; add a sync.RWMutex field (e.g.,
hcClientsMu) to ClientManager and use it to protect all accesses to hcClients:
acquire hcClientsMu.RLock() for readers and RUnlock(), and
hcClientsMu.Lock()/Unlock() for writers (creation, assignment, deletion). Update
any methods that reference hcClients (including constructors, getters, setters,
and cleanup code) to use the mutex, and ensure the mutex is initialized as part
of the ClientManager struct so the map accesses are always thread-safe.
- Around line 104-107: The client config currently sets config.Timeout = 0 which
allows API calls to hang indefinitely; change it to a finite duration (e.g., 15
seconds) by assigning config.Timeout to a time.Duration value (15 * time.Second)
so reconciler calls won’t stall, and ensure the time package is imported; leave
config.QPS and config.Burst as-is.
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 269-290: The inline IP checks inside validateServingIdentities
should reuse hasValidIPAddress to avoid duplicate logic: replace the current
loop that only checks ip == nil and ip.IsUnspecified() with a call to
hasValidIPAddress(ip) and return an error that reflects the full validation
(e.g., "CSR contains invalid IP address: %s") when it returns false;
alternatively if you prefer the simpler checks, remove the unused
hasValidIPAddress function to eliminate dead code—make sure only one of these
two approaches remains.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 205-231: The loop increments approvedCount whenever processCSR
returns nil, but processCSR currently returns nil for both approved and skipped
CSRs, causing over-counting; change processCSR signature to return (bool
approved, error) (or an equivalent enum/status) and update callers in the
bootstrapCSRs and servingCSRs loops to only increment approvedCount when
approved == true and err == nil; also update any other callers of processCSR to
handle the new return value and preserve existing error logging behavior (use
symbols: processCSR, approvedCount, bootstrapCSRs, servingCSRs).
---
Nitpick comments:
In `@internal/controller/csrapproval/controller.go`:
- Around line 34-39: The CSRApprovalReconciler struct contains unused fields
Scheme and Recorder; either remove these fields from CSRApprovalReconciler or
explicitly document why they’re retained (e.g., wiring consistency with main.go)
and reference their intended use. Locate the CSRApprovalReconciler type
definition and delete Scheme *runtime.Scheme and Recorder record.EventRecorder
if they are not referenced by Reconcile or SetupWithManager, or add a comment
above the struct explaining they are intentionally kept for wiring in main.go
and will be used in future logic; update any constructor/wiring code in main.go
accordingly to avoid linter confusion.
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 162-208: Extract the duplicated logic in validateBootstrapUsages
and validateServingUsages into a single helper validateRequiredUsages that takes
the CSR and the set/list of required usages (e.g. func
validateRequiredUsages(csr *certv1.CertificateSigningRequest, required
[]certv1.KeyUsage) error or a map), move the loop that marks found usages and
the final missing-check into that helper, and then have validateBootstrapUsages
and validateServingUsages call validateRequiredUsages with their respective
required usage sets (UsageDigitalSignature+UsageClientAuth and
UsageDigitalSignature+UsageServerAuth) so behavior and error messages
(fmt.Errorf("missing required usage: %s", usage)) remain the same.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 119-133: Replace the expensive list in nodeExists with a direct
node Get: call v.hostedClient.CoreV1().Nodes().Get(ctx, hostname,
metav1.GetOptions{}) (in function nodeExists) and handle errors using
apierrors.IsNotFound(err) to return (false, nil) when the node doesn't exist,
otherwise return (false, err) for other errors; add the import apierrors
"k8s.io/apimachinery/pkg/api/errors".
- Around line 104-117: Replace the List+loop in Validator.dpuExists with a
direct Get by namespaced name: construct a dpuprovisioningv1alpha1.DPU object
and call v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace:
v.dpuNamespace}, &dpu); if Get returns nil return (true, nil), if it returns a
NotFound error (use k8s.io/apimachinery/pkg/api/errors.IsNotFound) return
(false, nil), otherwise return (false, err). This avoids iterating over dpuList
and keeps the function signature and return semantics the same.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 260-261: The Validator is being created per CSR inside processCSR
(via NewValidator), causing repeated list queries; instead, create a single
Validator in processPendingCSRs and pass it into processCSR as a parameter
(e.g., add a *Validator arg to processCSR and remove the NewValidator call
inside it). Update all callers of processCSR accordingly, and ensure the
existing methods on Validator (dpuExists, nodeExists) are used from the shared
instance so future caching or batched list optimizations can be added centrally.
b3b6a0b to
61aaf40
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (6)
internal/controller/csrapproval/csr_owner_validation.go (3)
88-100: Drop the unnecessaryelseafter early return — idiomatic Go.The
elseblock on line 88 is unreachable in any other sense; the priorifalready returns, so theelseclause adds indentation noise and is non-idiomatic Go.♻️ Proposed refactor
- } else { - // Serving CSR: node SHOULD exist (already joined via bootstrap CSR) - if !nodeExists { - return &ValidationResult{ - Valid: false, - Reason: fmt.Sprintf("node %s does not exist yet in hosted cluster", hostname), - }, nil - } - return &ValidationResult{ - Valid: true, - Reason: "DPU exists and node already joined", - }, nil - } + // Serving CSR: node SHOULD exist (already joined via bootstrap CSR) + if !nodeExists { + return &ValidationResult{ + Valid: false, + Reason: fmt.Sprintf("node %s does not exist yet in hosted cluster", hostname), + }, nil + } + return &ValidationResult{ + Valid: true, + Reason: "DPU exists and node already joined", + }, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 88 - 100, Remove the unnecessary else block following the early return: in the function that constructs a ValidationResult using variables nodeExists and hostname (the block that currently returns fmt.Sprintf("node %s does not exist yet in hosted cluster", hostname) inside an else), delete the else and outdent its contents so the nodeExists check and the two ValidationResult returns (the "node ... does not exist yet in hosted cluster" and "DPU exists and node already joined") become the natural fall-through after the prior return; keep the same return values and use the same ValidationResult struct.
56-101: No unit tests forValidateCSROwner— security-critical path lacks coverage.The related files include
csr_content_validation_test.goandhostname_test.go, but nocsr_owner_validation_test.goappears in this PR. Given thatValidateCSROwneris the gatekeeper deciding whether a CSR gets auto-approved, the following cases should be covered at minimum:
- DPU missing → invalid result
- Bootstrap CSR, node absent → valid
- Bootstrap CSR, node present → invalid
- Serving CSR, node present → valid
- Serving CSR, node absent → invalid
dpuExists/nodeExistserrors propagate correctlyWould you like me to generate a
csr_owner_validation_test.gocovering these cases using the controller-runtime fake client and a fakekubernetes.Interface? I can open a new issue to track this as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 56 - 101, Add unit tests for ValidateCSROwner covering the six scenarios listed: DPU missing returns Valid=false with the expected Reason; bootstrap CSR with node absent returns Valid=true; bootstrap CSR with node present returns Valid=false; serving CSR with node present returns Valid=true; serving CSR with node absent returns Valid=false; and both dpuExists/nodeExists returning errors propagate the error. Create csr_owner_validation_test.go and use controller-runtime's fake client to control DPU existence and a fake kubernetes.Interface (or a client stub) to control nodeExists behavior; call ValidateCSROwner and assert the ValidationResult.Valid and Reason values for each case and that errors from dpuExists/nodeExists are returned (not swallowed). Ensure tests reference the Validator type and its methods dpuExists and nodeExists (mock/stub them or inject a test-friendly interface) so behavior is deterministic.
104-133: Both helpers do an O(n) full-list scan; prefer a direct point-lookup instead.
dpuExistslists every DPU in the namespace andnodeExistslists every Node in the hosted cluster, both only to match a single name. A directGetis O(1), avoids transferring the full list over the wire, and is simpler to read. Since DPU names map 1:1 to hostnames, there is no ambiguity in the lookup key.♻️ Proposed refactor
Add
apierrors "k8s.io/apimachinery/pkg/api/errors"to the import block, then replace both helpers:+import ( + "context" + "fmt" + + apierrors "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" + "k8s.io/client-go/kubernetes" + "sigs.k8s.io/controller-runtime/pkg/client" + + dpuprovisioningv1alpha1 "github.com/nvidia/doca-platform/api/provisioning/v1alpha1" +)func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) { - dpuList := &dpuprovisioningv1alpha1.DPUList{} - if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil { - return false, err - } - - for _, dpu := range dpuList.Items { - if dpu.Name == hostname { - return true, nil - } - } - - return false, nil + dpu := &dpuprovisioningv1alpha1.DPU{} + err := v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace}, dpu) + if err != nil { + if apierrors.IsNotFound(err) { + return false, nil + } + return false, err + } + return true, nil }func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) { - nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) + _, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) if err != nil { - return false, err - } - - for _, node := range nodeList.Items { - if node.Name == hostname { - return true, nil + if apierrors.IsNotFound(err) { + return false, nil } + return false, err } - - return false, nil + return true, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 104 - 133, Replace the full-list scans in dpuExists and nodeExists with direct Get calls: import k8s apierrors and call v.mgmtClient.Get(ctx, client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname}, &dpuprovisioningv1alpha1.DPU{}) in dpuExists and use v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in nodeExists; treat apierrors.IsNotFound as a false/no-error result and return other errors upward so you avoid O(n) list operations and return correct boolean+error behavior for both helpers.internal/controller/csrapproval/client.go (1)
159-191:replaceServerWithInternalEndpointhas two unused parameters and a package-name shadow.
ctx context.ContextandmgmtClient client.Clientare accepted as parameters but never referenced in the function body — they are dead parameters.context, ok := kubeconfig.Contexts[currentContext](Line 170) declares a local variable namedcontextthat shadows the imported"context"standard library package within this function's scope. Rename it tokubeconfigCtxorctxEntryto avoid future confusion.♻️ Proposed cleanup
-func replaceServerWithInternalEndpoint(ctx context.Context, mgmtClient client.Client, kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error { +func replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error { if kubeconfig == nil { return fmt.Errorf("kubeconfig is nil") } currentContext := kubeconfig.CurrentContext if currentContext == "" { return fmt.Errorf("kubeconfig has no current context") } - context, ok := kubeconfig.Contexts[currentContext] + kubeconfigCtx, ok := kubeconfig.Contexts[currentContext] if !ok { return fmt.Errorf("context %s not found in kubeconfig", currentContext) } - clusterName := context.Cluster + clusterName := kubeconfigCtx.Cluster ... }And update the call site in
createHostedClusterClient:- if err := replaceServerWithInternalEndpoint(ctx, cm.mgmtClient, kubeconfig, namespace, name); err != nil { + if err := replaceServerWithInternalEndpoint(kubeconfig, namespace, name); err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/client.go` around lines 159 - 191, The function replaceServerWithInternalEndpoint currently accepts unused parameters ctx and mgmtClient and declares a local variable named context that shadows the imported context package; remove the unused parameters from replaceServerWithInternalEndpoint's signature and update its callers (e.g., createHostedClusterClient) to match the new signature, and rename the local variable context (e.g., to kubeconfigCtx or ctxEntry) where kubeconfig.Contexts[currentContext] is read to avoid shadowing the standard library package.api/v1alpha1/dpfhcpprovisioner_types.go (1)
238-239: Consider a more descriptive value forReasonCSRApprovalActive.The value
"Active"is ambiguous in isolation —kubectl get dpfhcp -o yamlwill show only theReasonfield without the condition type context. Every other reason constant in this file uses a self-describing string (e.g.,"AllComponentsOperational","HostedClusterNotReady"). A value like"CSRApprovalActive"or"ApprovalControllerRunning"would be consistent with that convention.✏️ Proposed rename
- ReasonCSRApprovalActive string = "Active" + ReasonCSRApprovalActive string = "CSRApprovalActive"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/dpfhcpprovisioner_types.go` around lines 238 - 239, The constant ReasonCSRApprovalActive currently uses the ambiguous value "Active"; change its string value to a more descriptive identifier such as "CSRApprovalActive" (i.e., keep the constant name ReasonCSRApprovalActive but set its value to "CSRApprovalActive"), and update any usages/tests/fixtures that assert on the prior "Active" literal (search for ReasonCSRApprovalActive and any string comparisons to "Active") so output in kubectl shows the self-describing reason.internal/controller/csrapproval/csrapproval.go (1)
264-266:NewValidatorallocated on every CSR — consider hoisting toprocessPendingCSRs.All three constructor arguments (
a.mgmtClient,hcClient,dpfhcp.Spec.DPUClusterRef.Namespace) are invariant across the entire CSR batch. Creating oneValidatorper CSR is functionally harmless but wasteful. Moving creation toprocessPendingCSRsand threading it intoprocessCSR(or making it a field) would be cleaner.♻️ Proposed refactor
-// processCSR processes a single CSR (validate and approve if valid) -func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string) error { +// processCSR processes a single CSR (validate and approve if valid) +func (a *CSRApprover) processCSR(ctx context.Context, validator *Validator, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner) error {And in
processPendingCSRs, create the validator once before the loops:+ validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) + // Process bootstrap CSRs for _, csr := range bootstrapCSRs { - if err := a.processCSR(ctx, dpfhcp, hcClient, &csr, "bootstrap"); err != nil { + if err := a.processCSR(ctx, validator, hcClient, &csr, "bootstrap", dpfhcp); err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csrapproval.go` around lines 264 - 266, NewValidator is being allocated per CSR; instead, create the Validator once in processPendingCSRs using the invariant arguments (a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) and pass that single instance into processCSR (or store it as a field on the approver struct) so processCSR calls validator.ValidateCSROwner(ctx, hostname, isBootstrapCSR) without re-allocating; update function signatures (processCSR) or the approver struct accordingly and remove the per-CSR NewValidator call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 31-39: ClientManager's hcClients map is accessed concurrently and
must be protected: add a sync.RWMutex field (e.g., mu sync.RWMutex) to the
ClientManager struct and update all places that read or write hcClients — use
mu.RLock()/RUnlock() around reads in GetHostedClusterClient and any cache lookup
paths, and use mu.Lock()/Unlock() around writes such as creating or deleting
entries in InvalidateClient and map initialization (ensure hcClients is
lazy-initialized under the write lock). Also scan for any other methods
referencing hcClients and wrap their map accesses with the appropriate lock to
eliminate the data race.
- Around line 194-200: TestConnection currently ignores the passed ctx because
Discovery().ServerVersion() is not context-aware; change TestConnection to
perform a context-bound REST GET against the API /version using the discovery
REST client (e.g.
clientset.Discovery().RESTClient().Get().AbsPath("/version").Do(ctx).Into(&version.Info{}))
so the request respects ctx (or alternatively ensure the client REST config has
a non-zero Timeout). Update the function TestConnection to decode the response
into a version.Info and return any error from the Do(ctx) call instead of
calling ServerVersion().
- Around line 105-108: The client config currently sets config.Timeout = 0 which
allows List/Update calls to block forever; change this to a finite timeout (e.g.
10–30s) so CSR polling and approve operations fail fast when the hosted API is
unresponsive. Update the code that builds the rest.Config (the config variable
in internal/controller/csrapproval/client.go) to set a reasonable non-zero
timeout and ensure the clientset creation (where this config is used) inherits
that timeout so List/Update calls in the reconcile loop (30-second loop) won't
hang indefinitely.
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 275-295: The IP validation helper hasValidIPAddress is unused
while validateServingIdentities currently only checks for nil and
IsUnspecified(); update the certificate-serving identity loop that iterates
certReq.IPAddresses in validateServingIdentities to call hasValidIPAddress(ip)
and return an error when it returns false (including the offending IP in the
message), or if you prefer remove hasValidIPAddress and consolidate its stricter
checks (IsUnspecified, IsLoopback, IsMulticast) directly into
validateServingIdentities so the IP validation logic is consistent and not
duplicated.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 76-83: The current isBootstrapCSR path rejects all bootstrap CSRs
when nodeExists is true; instead, change the logic in csr_owner_validation.go
(the isBootstrapCSR branch that returns ValidationResult) to only reject when
the existing Node already has an active/valid kubelet client certificate, and
allow the bootstrap CSR when the Node exists but does not have such a valid
client cert (i.e., implement a helper like isNodeHasValidKubeletCert(node) and
call it before returning the ValidationResult). Keep the same ValidationResult
shape but replace the plain nodeExists check with a call to the helper so
legitimate re-bootstrap/cert-recovery CSRs are accepted while nodes that already
have active kubelet certs are still blocked.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 89-98: When TestConnection(ctx, hcClient) returns an error inside
the reconcile block in csrapproval.go, call
a.clientManager.InvalidateClient(...) for the same HostedCluster reference
before setting condition/recording event/returning so the stale cached client is
cleared; update the error branch that currently logs the error and calls
a.setCondition / a.recorder.Event to first invalidate the cached client obtained
via GetHostedClusterClient (use the same cluster identifier used to retrieve the
client) and then proceed with setCondition and Event and return.
- Around line 208-228: approvedCount is over-counting because processCSR
currently returns nil for both skipped and approved CSRs; change processCSR
signature to return (bool, error) where the bool indicates whether the CSR was
actually approved, update all call sites in csrapproval.go (the two loops
processing bootstrapCSRs and servingCSRs) to capture (approved, err) and only
increment approvedCount when approved==true, propagate/log err when non-nil, and
ensure all internal returns in processCSR return (false, nil) for
validation-skipped CSRs and (true, nil) for successfully approved CSRs so
pendingRelevantCSRs and the status message reflect true approvals.
---
Nitpick comments:
In `@api/v1alpha1/dpfhcpprovisioner_types.go`:
- Around line 238-239: The constant ReasonCSRApprovalActive currently uses the
ambiguous value "Active"; change its string value to a more descriptive
identifier such as "CSRApprovalActive" (i.e., keep the constant name
ReasonCSRApprovalActive but set its value to "CSRApprovalActive"), and update
any usages/tests/fixtures that assert on the prior "Active" literal (search for
ReasonCSRApprovalActive and any string comparisons to "Active") so output in
kubectl shows the self-describing reason.
In `@internal/controller/csrapproval/client.go`:
- Around line 159-191: The function replaceServerWithInternalEndpoint currently
accepts unused parameters ctx and mgmtClient and declares a local variable named
context that shadows the imported context package; remove the unused parameters
from replaceServerWithInternalEndpoint's signature and update its callers (e.g.,
createHostedClusterClient) to match the new signature, and rename the local
variable context (e.g., to kubeconfigCtx or ctxEntry) where
kubeconfig.Contexts[currentContext] is read to avoid shadowing the standard
library package.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 88-100: Remove the unnecessary else block following the early
return: in the function that constructs a ValidationResult using variables
nodeExists and hostname (the block that currently returns fmt.Sprintf("node %s
does not exist yet in hosted cluster", hostname) inside an else), delete the
else and outdent its contents so the nodeExists check and the two
ValidationResult returns (the "node ... does not exist yet in hosted cluster"
and "DPU exists and node already joined") become the natural fall-through after
the prior return; keep the same return values and use the same ValidationResult
struct.
- Around line 56-101: Add unit tests for ValidateCSROwner covering the six
scenarios listed: DPU missing returns Valid=false with the expected Reason;
bootstrap CSR with node absent returns Valid=true; bootstrap CSR with node
present returns Valid=false; serving CSR with node present returns Valid=true;
serving CSR with node absent returns Valid=false; and both dpuExists/nodeExists
returning errors propagate the error. Create csr_owner_validation_test.go and
use controller-runtime's fake client to control DPU existence and a fake
kubernetes.Interface (or a client stub) to control nodeExists behavior; call
ValidateCSROwner and assert the ValidationResult.Valid and Reason values for
each case and that errors from dpuExists/nodeExists are returned (not
swallowed). Ensure tests reference the Validator type and its methods dpuExists
and nodeExists (mock/stub them or inject a test-friendly interface) so behavior
is deterministic.
- Around line 104-133: Replace the full-list scans in dpuExists and nodeExists
with direct Get calls: import k8s apierrors and call v.mgmtClient.Get(ctx,
client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname},
&dpuprovisioningv1alpha1.DPU{}) in dpuExists and use
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in
nodeExists; treat apierrors.IsNotFound as a false/no-error result and return
other errors upward so you avoid O(n) list operations and return correct
boolean+error behavior for both helpers.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 264-266: NewValidator is being allocated per CSR; instead, create
the Validator once in processPendingCSRs using the invariant arguments
(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) and pass that
single instance into processCSR (or store it as a field on the approver struct)
so processCSR calls validator.ValidateCSROwner(ctx, hostname, isBootstrapCSR)
without re-allocating; update function signatures (processCSR) or the approver
struct accordingly and remove the per-CSR NewValidator call.
61aaf40 to
9454448
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (6)
internal/controller/csrapproval/csrapproval.go (2)
148-220:processPendingCSRsswallows individual CSR processing errors — consider aggregating.Lines 202-206 and 212-216 log errors from
processCSRbut continue and returnnilfromprocessPendingCSRs. If all CSRs fail, the caller at Line 104 sees no error and reports "CSR auto-approval active" — which is technically true but masks systemic issues. Consider returning an aggregated error (or a count of failures) so the condition message reflects the true state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csrapproval.go` around lines 148 - 220, processPendingCSRs currently logs per-CSR errors from processCSR but always returns nil, masking systemic failures; modify processPendingCSRs to track failures (e.g., failureCount or collect error messages) while iterating over bootstrapCSRs and servingCSRs and after both loops return a non-nil aggregated error (or fmt.Errorf with the failure count and brief details) when any CSR processing failed; ensure you update the return to include context (reference processPendingCSRs and processCSR) and keep per-CSR log calls but also surface a summary error to the caller so the caller can detect and report when auto-approval failed for all/part of the CSRs.
248-262:NewValidatoris instantiated per-CSR; hoist it out of the loop.
mgmtClient,hcClient, anddpuNamespacedon't change between CSRs in the same poll cycle. Creating aValidator(and its embedded clients) for every CSR is wasteful. Create it once inprocessPendingCSRsand pass it intoprocessCSR.♻️ Proposed fix
In
processPendingCSRs, create the validator once and pass it toprocessCSR:func (a *CSRApprover) processPendingCSRs(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset) error { log := logf.FromContext(ctx) + validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) // ... (existing code) ... for _, csr := range bootstrapCSRs { - _, err := a.processCSR(ctx, dpfhcp, hcClient, &csr, "bootstrap") + _, err := a.processCSR(ctx, dpfhcp, hcClient, &csr, "bootstrap", validator)Then in
processCSR, accept and use the pre-built validator:-func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string) (bool, error) { +func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string, validator *Validator) (bool, error) { // ... - validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) isBootstrapCSR := csr.Spec.SignerName == SignerBootstrap🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csrapproval.go` around lines 248 - 262, NewValidator is being created inside processCSR for every CSR which is wasteful because mgmtClient, hcClient and dpfhcp.Spec.DPUClusterRef.Namespace are constant per poll; instead instantiate validator once in processPendingCSRs (using mgmtClient, hcClient and the DPU namespace), then change processCSR to accept a Validator parameter and remove the NewValidator call inside processCSR; update all calls to processCSR to pass the prebuilt validator and ensure processCSR uses the injected Validator (ValidateCSROwner) rather than creating its own.internal/controller/csrapproval/csr_content_validation.go (1)
108-166: Minor style inconsistency betweenvalidateBootstrapperIdentityandvalidateNodeIdentitygroup-checking logic.
validateBootstrapperIdentityuses amap[string]boolpattern to check required groups (Lines 116-133), whilevalidateNodeIdentityuses individual boolean flags (Lines 147-163). Both are correct, but the inconsistency slightly hurts readability. Consider unifying to one pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_content_validation.go` around lines 108 - 166, Unify the group-checking pattern by changing validateNodeIdentity to use the same map[string]bool approach as validateBootstrapperIdentity: keep the username check (expectedUsername := SystemNodePrefix + hostname) but replace the hasNodesGroup/hasAuthenticatedGroup booleans with a requiredGroups map initialized with groupNodes and groupAuthenticated set to false, iterate csr.Spec.Groups to mark found entries, then loop over requiredGroups to return an error if any remain false; reference validateNodeIdentity, validateBootstrapperIdentity, groupNodes and groupAuthenticated when making the change.internal/controller/csrapproval/client.go (2)
56-79: TOCTOU inGetHostedClusterClient— benign but may cause duplicate client creation.Two concurrent goroutines can both miss the read-lock cache check (Lines 60-65), both create a client (Line 68), and both insert into the map (Lines 74-76) — the second write silently overwrites the first. The discarded client leaks its underlying transport connections until GC.
This is a minor concern since it only happens on the first access for a given key and the impact is a single wasted client creation. If you want to tighten it, a double-check after acquiring the write lock would suffice.
♻️ Double-check pattern
// Cache the client with write lock cm.mu.Lock() + // Double-check: another goroutine may have inserted while we were creating + if existing, ok := cm.hcClients[key]; ok { + cm.mu.Unlock() + return existing, nil + } cm.hcClients[key] = clientset cm.mu.Unlock()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/client.go` around lines 56 - 79, GetHostedClusterClient has a TOCTOU where two goroutines can both create a client and one overwrites the other; after creating the new client via createHostedClusterClient, acquire cm.mu.Lock() and re-check cm.hcClients[key] — if an entry now exists, release the lock and return the existing client, otherwise store the newly created client and return it; if you need to avoid leaking the discarded client's connections, call a small cleanup helper (e.g., closeClient(newClient)) when you detect an existing entry.
172-205: Remove unusedctxandmgmtClientparameters fromreplaceServerWithInternalEndpoint.The function signature accepts
ctx context.ContextandmgmtClient client.Clientbut neither is used in the function body. Removing them reduces noise and eliminates confusion about unnecessary dependencies.♻️ Proposed fix
-func replaceServerWithInternalEndpoint(ctx context.Context, mgmtClient client.Client, kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error { +func replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {Update the call site at line 107:
- if err := replaceServerWithInternalEndpoint(ctx, cm.mgmtClient, kubeconfig, namespace, name); err != nil { + if err := replaceServerWithInternalEndpoint(kubeconfig, namespace, name); err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/client.go` around lines 172 - 205, The function replaceServerWithInternalEndpoint currently declares unused parameters ctx and mgmtClient; remove these parameters from the signature so it becomes replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error, update the function body unchanged, and then update every call site that invokes replaceServerWithInternalEndpoint to stop passing a context and client (pass only the kubeconfig, hostedClusterNamespace and hostedClusterName). Also adjust any unit tests or helper wrappers that reference this function signature to match the new parameter list.internal/controller/csrapproval/csr_owner_validation.go (1)
98-128: ReplaceList+ iteration with directGetlookups for better performance.Both helpers list all objects then iterate to find by name. Use
client.Get()fordpuExistsandNodes().Get()fornodeExistsinstead—these are O(1) indexed lookups at the API server versus linear scans, which matters when processing multiple CSRs per poll cycle in clusters with many DPUs/nodes.♻️ Proposed fix
Add
apierrorsto imports:import ( "context" "fmt" apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes" "sigs.k8s.io/controller-runtime/pkg/client" dpuprovisioningv1alpha1 "github.com/nvidia/doca-platform/api/provisioning/v1alpha1" )Then update the functions:
func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) { - dpuList := &dpuprovisioningv1alpha1.DPUList{} - if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil { - return false, err - } - - for _, dpu := range dpuList.Items { - if dpu.Name == hostname { - return true, nil - } + dpu := &dpuprovisioningv1alpha1.DPU{} + err := v.mgmtClient.Get(ctx, client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname}, dpu) + if err != nil { + if client.IgnoreNotFound(err) == nil { + return false, nil + } + return false, err } - - return false, nil + return true, nil } func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) { - nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) + _, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) if err != nil { + if apierrors.IsNotFound(err) { + return false, nil + } return false, err } - - for _, node := range nodeList.Items { - if node.Name == hostname { - return true, nil - } - } - - return false, nil + return true, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 98 - 128, Replace the inefficient list+iterate logic in dpuExists and nodeExists with direct GETs: use v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace}, &dpuprovisioningv1alpha1.DPU{}) in dpuExists and v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in nodeExists, handle not-found using apierrors.IsNotFound to return (false, nil) and return other errors unchanged; add the apierrors import (k8s.io/apimachinery/pkg/api/errors) and reference the existing v.mgmtClient, v.dpuNamespace, and v.hostedClient symbols when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 33-53: The review metadata contains a duplicate review marker
([duplicate_comment]) and redundant approval tokens; remove the duplicate
comment/approval tag(s) so there's a single clear approval entry for this change
and no duplicated review lines, leaving the code as-is (ClientManager and
NewClientManager require no further changes).
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 260-292: The CSR IP validation previously duplicated logic; update
validateServingIdentities to call hasValidIPAddress(certReq.IPAddresses...) by
replacing any inline IP checks with calls to hasValidIPAddress (function:
hasValidIPAddress) and keep the existing hostname check in
validateServingIdentities; ensure hasValidIPAddress continues to reject
unspecified, loopback, and multicast addresses and return errors from
validateServingIdentities on any invalid IPs so the change is fully integrated
and tests cover both DNS name (hostname) and IP validation.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 52-96: ValidateCSROwner currently implements the intended
behavior: call v.dpuExists(ctx, hostname) and if found allow bootstrap CSRs
(isBootstrapCSR true) regardless of node state, otherwise for serving CSRs call
v.nodeExists(ctx, hostname) and require the node to exist; keep this logic
as-is, ensure the error wraps from dpuExists/nodeExists remain
(fmt.Errorf("failed to check DPU existence: %w", err) and fmt.Errorf("failed to
check Node existence: %w", err)), preserve the ValidationResult messages and use
v.dpuNamespace in the "no matching DPU" reason string to maintain informative
diagnostics.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 88-101: On TestConnection failure, ensure you invalidate the
cached hosted-cluster client by calling
a.clientManager.InvalidateClient(dpfhcp.Namespace, dpfhcp.Name) before setting
status/recording events so the next reconcile creates a fresh client; then call
a.setCondition(ctx, dpfhcp, metav1.ConditionFalse,
provisioningv1alpha1.ReasonHostedClusterNotReachable, fmt.Sprintf("Cannot
connect to hosted cluster: %v", err)) and a.recorder.Event(dpfhcp,
corev1.EventTypeWarning, "HostedClusterUnreachable", fmt.Sprintf("Cannot connect
to hosted cluster: %v", err)), and return ctrl.Result{}, nil (all within the
TestConnection error branch that currently logs via log.Error(err, "Hosted
cluster not reachable")).
- Around line 293-316: setCondition now correctly uses meta.SetStatusCondition
to avoid duplicate conditions; keep this implementation and the event-on-change
logic (in setCondition) as-is: only emit events via a.recorder.Event when
meta.SetStatusCondition(&dpfhcp.Status.Conditions, condition) returns true, and
only call a.mgmtClient.Status().Update(ctx, dpfhcp) in that branch; verify
references to setCondition, meta.SetStatusCondition, a.recorder.Event, and
a.mgmtClient.Status().Update are preserved and not reverted.
---
Nitpick comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 56-79: GetHostedClusterClient has a TOCTOU where two goroutines
can both create a client and one overwrites the other; after creating the new
client via createHostedClusterClient, acquire cm.mu.Lock() and re-check
cm.hcClients[key] — if an entry now exists, release the lock and return the
existing client, otherwise store the newly created client and return it; if you
need to avoid leaking the discarded client's connections, call a small cleanup
helper (e.g., closeClient(newClient)) when you detect an existing entry.
- Around line 172-205: The function replaceServerWithInternalEndpoint currently
declares unused parameters ctx and mgmtClient; remove these parameters from the
signature so it becomes replaceServerWithInternalEndpoint(kubeconfig
*clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error,
update the function body unchanged, and then update every call site that invokes
replaceServerWithInternalEndpoint to stop passing a context and client (pass
only the kubeconfig, hostedClusterNamespace and hostedClusterName). Also adjust
any unit tests or helper wrappers that reference this function signature to
match the new parameter list.
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 108-166: Unify the group-checking pattern by changing
validateNodeIdentity to use the same map[string]bool approach as
validateBootstrapperIdentity: keep the username check (expectedUsername :=
SystemNodePrefix + hostname) but replace the hasNodesGroup/hasAuthenticatedGroup
booleans with a requiredGroups map initialized with groupNodes and
groupAuthenticated set to false, iterate csr.Spec.Groups to mark found entries,
then loop over requiredGroups to return an error if any remain false; reference
validateNodeIdentity, validateBootstrapperIdentity, groupNodes and
groupAuthenticated when making the change.
In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 98-128: Replace the inefficient list+iterate logic in dpuExists
and nodeExists with direct GETs: use v.mgmtClient.Get(ctx,
types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace},
&dpuprovisioningv1alpha1.DPU{}) in dpuExists and
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in
nodeExists, handle not-found using apierrors.IsNotFound to return (false, nil)
and return other errors unchanged; add the apierrors import
(k8s.io/apimachinery/pkg/api/errors) and reference the existing v.mgmtClient,
v.dpuNamespace, and v.hostedClient symbols when making these changes.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 148-220: processPendingCSRs currently logs per-CSR errors from
processCSR but always returns nil, masking systemic failures; modify
processPendingCSRs to track failures (e.g., failureCount or collect error
messages) while iterating over bootstrapCSRs and servingCSRs and after both
loops return a non-nil aggregated error (or fmt.Errorf with the failure count
and brief details) when any CSR processing failed; ensure you update the return
to include context (reference processPendingCSRs and processCSR) and keep
per-CSR log calls but also surface a summary error to the caller so the caller
can detect and report when auto-approval failed for all/part of the CSRs.
- Around line 248-262: NewValidator is being created inside processCSR for every
CSR which is wasteful because mgmtClient, hcClient and
dpfhcp.Spec.DPUClusterRef.Namespace are constant per poll; instead instantiate
validator once in processPendingCSRs (using mgmtClient, hcClient and the DPU
namespace), then change processCSR to accept a Validator parameter and remove
the NewValidator call inside processCSR; update all calls to processCSR to pass
the prebuilt validator and ensure processCSR uses the injected Validator
(ValidateCSROwner) rather than creating its own.
9454448 to
ac343f3
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (8)
internal/controller/csrapproval/csr_content_validation_test.go (2)
180-327: Add a test case forValidateServingCSRinvalid organization (step 4 is untested).
ValidateServingCSRcallsvalidateOrganization(certReq)(step 4), but all fiveValidateServingCSRtest cases use a correct"system:nodes"organization. An invalid organization would slip through undetected if the validation were regressed.✅ Proposed additional test case
+ Context("When CSR has invalid organization", func() { + It("should fail validation", func() { + hostname := "dpu-worker-01" + // DNS helper still uses the right hostname, but org is wrong + csrBytes := createServingCSRWithDNS("system:node:"+hostname, "wrong-org", hostname) + + csr := &certv1.CertificateSigningRequest{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-serving-invalid-org", + }, + Spec: certv1.CertificateSigningRequestSpec{ + SignerName: SignerServing, + Username: "system:node:" + hostname, + Groups: []string{ + groupNodes, + groupAuthenticated, + }, + Usages: []certv1.KeyUsage{ + certv1.UsageDigitalSignature, + certv1.UsageServerAuth, + }, + Request: csrBytes, + }, + } + + err := ValidateServingCSR(csr, hostname) + Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring("invalid organization")) + }) + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_content_validation_test.go` around lines 180 - 327, The tests for ValidateServingCSR are missing a case that asserts validateOrganization(certReq) fails when the CSR organization is not "system:nodes"; add a new spec similar to the other Context blocks that builds a CSR with createServingCSRWithDNS using a non-matching organization (e.g. "wrong-org"), sets Spec.Username to "system:node:"+hostname and Groups to include/not include as needed, calls ValidateServingCSR(csr, hostname) and expects an error containing a message about invalid organization (or the exact substring produced by validateOrganization); use a descriptive test name like "When CSR has invalid organization" to locate it alongside the other Contexts and reference ValidateServingCSR and validateOrganization in the test for clarity.
27-177: Add a test case forValidateBootstrapCSRCN/hostname mismatch (step 5 is untested).The implementation's step 5 —
validateCN(certReq, hostname)— has no corresponding failure test. Every other validation step (identity, usages, org) has a negative test, but a CSR whose CN does not match the expected hostname will silently pass in this test suite if the CN validation were ever broken.✅ Proposed additional test case
+ Context("When CSR has CN that does not match the hostname", func() { + It("should fail validation", func() { + hostname := "dpu-worker-01" + // Create CSR bytes with a different CN + csrBytes := createTestCSRWithOrganization("system:node:other-node", "system:nodes") + + csr := &certv1.CertificateSigningRequest{ + ObjectMeta: metav1.ObjectMeta{ + Name: "test-cn-mismatch", + }, + Spec: certv1.CertificateSigningRequestSpec{ + SignerName: SignerBootstrap, + Username: BootstrapperUsername, + Groups: []string{ + groupServiceAccounts, + groupServiceAccountsMachineConfig, + groupAuthenticated, + }, + Usages: []certv1.KeyUsage{ + certv1.UsageDigitalSignature, + certv1.UsageClientAuth, + }, + Request: csrBytes, + }, + } + + err := ValidateBootstrapCSR(csr, hostname) + Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring("CN validation failed")) + }) + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csr_content_validation_test.go` around lines 27 - 177, Add a negative test that covers step 5 (CN/hostname validation) by creating a CSR whose organization and other fields are valid but whose common name (CN) does not match the provided hostname; call ValidateBootstrapCSR(csr, hostname) and assert it returns an error and that err.Error() contains a substring indicating the CN/hostname mismatch (e.g., "CN" or "common name"); place this new test as a Context similar to the other negative cases and reuse helpers like createTestCSRWithOrganization and constants such as SignerBootstrap and BootstrapperUsername so it exercises validateCN(certReq, hostname).internal/controller/csrapproval/hostname_test.go (1)
296-390: Consider extracting common CSR generation logic to reduce duplication.The three helper functions (
createTestCSR,createTestCSRWithOrganization,createServingCSRWithDNS) share nearly identical key generation, CSR creation, and PEM encoding logic. A single parameterized helper would reduce the ~90 lines of duplication.♻️ Example consolidation
+// createTestCSRRaw creates a test CSR with the given template +func createTestCSRRaw(template x509.CertificateRequest) []byte { + privateKey, err := rsa.GenerateKey(rand.Reader, 2048) + if err != nil { + panic(err) + } + csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey) + if err != nil { + panic(err) + } + return pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE REQUEST", Bytes: csrBytes}) +} + func createTestCSR(cn string) []byte { - privateKey, err := rsa.GenerateKey(rand.Reader, 2048) - if err != nil { - panic(err) - } - template := x509.CertificateRequest{ + return createTestCSRRaw(x509.CertificateRequest{ Subject: pkix.Name{CommonName: cn}, - } - csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey) - if err != nil { - panic(err) - } - pemBlock := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE REQUEST", Bytes: csrBytes}) - return pemBlock + }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/hostname_test.go` around lines 296 - 390, The three helpers createTestCSR, createTestCSRWithOrganization, and createServingCSRWithDNS duplicate key generation, x509.CreateCertificateRequest and PEM encoding; consolidate them by adding a single helper (e.g., new createCSR helper) that takes parameters for commonName (cn string), organizations ([]string) and dnsNames ([]string) and performs rsa.GenerateKey, x509.CreateCertificateRequest and pem.EncodeToMemory once, returning []byte; then change createTestCSR, createTestCSRWithOrganization and createServingCSRWithDNS to call createCSR with the appropriate args (nil/empty slices where not needed) so the duplicated logic is removed while keeping the same panic-on-error behavior for tests.internal/controller/csrapproval/client.go (2)
56-79: Minor TOCTOU in double-checked locking — two goroutines can create duplicate clients for the same key.Between
RUnlock()at Line 65 andLock()at Line 74, a second goroutine can also miss the cache and create a redundant client. The extra client is silently overwritten and GC'd, so this isn't a correctness issue, but it wastes a kubeconfig parse + TCP connection setup.A simple fix is to re-check the map after acquiring the write lock:
♻️ Proposed fix
// Create new client (outside lock to avoid holding lock during slow operation) clientset, err := cm.createHostedClusterClient(ctx, namespace, name) if err != nil { return nil, err } // Cache the client with write lock cm.mu.Lock() + // Re-check: another goroutine may have populated the cache while we were creating + if existing, ok := cm.hcClients[key]; ok { + cm.mu.Unlock() + return existing, nil + } cm.hcClients[key] = clientset cm.mu.Unlock() return clientset, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/client.go` around lines 56 - 79, GetHostedClusterClient uses double-checked locking and can create duplicate clients: after creating a client with createHostedClusterClient and before storing it into cm.hcClients, re-acquire the write lock (cm.mu.Lock) and re-check cm.hcClients[key]; if an entry now exists, return that existing client and dispose/close the newly created one (if it supports closing) instead of unconditionally overwriting; otherwise store the newly created client and return it. Ensure you use the same key computed in GetHostedClusterClient and the same lock fields cm.mu and map cm.hcClients in the fix.
172-205: Remove unusedctxandmgmtClientparameters fromreplaceServerWithInternalEndpoint.The function performs pure in-memory kubeconfig transformation and does not reference either parameter. Removing them simplifies the signature and clarifies that no I/O is performed.
♻️ Proposed fix
-func replaceServerWithInternalEndpoint(ctx context.Context, mgmtClient client.Client, kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error { +func replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {Update the call site at line 107:
- if err := replaceServerWithInternalEndpoint(ctx, cm.mgmtClient, kubeconfig, namespace, name); err != nil { + if err := replaceServerWithInternalEndpoint(kubeconfig, namespace, name); err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/client.go` around lines 172 - 205, The function replaceServerWithInternalEndpoint has unused parameters ctx and mgmtClient; remove those parameters from its signature and definition so it only accepts (kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) (or reorder to match current usage), update all call sites to stop passing a context or management client, and run tests/compile to ensure no remaining references to replaceServerWithInternalEndpoint(ctx, mgmtClient, ...) remain.internal/controller/csrapproval/controller.go (1)
101-107: Both controllers watch the same resource type — verify no event contention.Both
dpfhcpprovisionerandcsrapprovalcontrollers useFor(&provisioningv1alpha1.DPFHCPProvisioner{}). This is supported by controller-runtime (each controller gets its own work queue), but every status update from either controller will trigger reconciliation in both controllers. Since the CSR approval controller updates theCSRAutoApprovalActivecondition on the same CR, this creates a feedback loop where each CSR approval status update re-triggers both controllers.This is functionally safe (the provisioner controller will no-op if nothing changed, and the CSR controller will just re-poll), but it doubles the reconciliation rate. Worth being aware of for observability/log noise at scale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/controller.go` around lines 101 - 107, The two controllers both watch DPFHCPProvisioner and status-only updates cause both to reconcile; update CSRApprovalReconciler's SetupWithManager to add an event filter so status-only changes (like CSRAutoApprovalActive flips) don't trigger reconciliation for the other controller — for example replace the current ctrl.NewControllerManagedBy(mgr).For(...) chain in SetupWithManager with one that calls WithEventFilter(predicate.GenerationChangedPredicate{}) or a custom predicate (predicate.Funcs) that ignores updates where only the status/conditions changed, or that only requeues when the CSRAutoApprovalActive condition itself changes; modify the SetupWithManager function in CSRApprovalReconciler accordingly.internal/controller/csrapproval/csrapproval.go (2)
107-113: Consider a distinctReasonfor the transient-error condition state.Setting
ConditionTruewithReasonCSRApprovalActivewhenprocessPendingCSRsreturns an error conflates a degraded state with a healthy one. Alerting/monitoring rules keyed onCSRAutoApprovalActive=Truewill silently miss this degraded path. A separate reason (e.g.,ReasonCSRApprovalDegraded) would let operators distinguish "active and healthy" from "active but encountering errors."♻️ Proposed change
- if condErr := a.setCondition(ctx, dpfhcp, metav1.ConditionTrue, provisioningv1alpha1.ReasonCSRApprovalActive, - fmt.Sprintf("CSR processing encountered transient error: %v", err)); condErr != nil { + if condErr := a.setCondition(ctx, dpfhcp, metav1.ConditionTrue, provisioningv1alpha1.ReasonCSRApprovalDegraded, + fmt.Sprintf("CSR processing encountered transient error: %v", err)); condErr != nil {(Add
ReasonCSRApprovalDegradedto the API constants.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csrapproval.go` around lines 107 - 113, The current code marks the condition as ConditionTrue with ReasonCSRApprovalActive even when processPendingCSRs returns an error; add a new reason constant (e.g., ReasonCSRApprovalDegraded) to the API constants and use it in the error branch so setCondition(ctx, dpfhcp, metav1.ConditionTrue, ReasonCSRApprovalDegraded, fmt.Sprintf(...)) is called when err != nil (leave the existing ReasonCSRApprovalActive for the successful/healthy path); update any references/tests to the reason name accordingly so operators can distinguish active+healthy vs active+degraded.
153-156: Consider filtering pending CSRs server-side to reduce per-poll data transfer.On every 30-second poll all CSRs are fetched and filtered in-memory. For hosted clusters with many historical (approved/denied) CSRs, this is wasteful. A field selector or label selector at the API level would reduce the payload.
♻️ Example — field selector for unapproved CSRs
- csrList, err := hcClient.CertificatesV1().CertificateSigningRequests().List(ctx, metav1.ListOptions{}) + // Filter server-side to CSRs that have not yet been approved/denied + csrList, err := hcClient.CertificatesV1().CertificateSigningRequests().List(ctx, metav1.ListOptions{ + // Note: Kubernetes does not support status-based field selectors on CSRs natively; + // use a label convention on bootstrap/serving CSRs if available, or keep in-memory filter + // but at minimum set a ResourceVersion to avoid unnecessary full re-lists. + ResourceVersion: "0", // return from cache + })(The most practical improvement is to add
ResourceVersion: "0"so the apiserver can serve from its watch cache rather than etcd on every poll.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/csrapproval/csrapproval.go` around lines 153 - 156, Currently all CSRs are fetched then filtered in-memory; change the List call to request only pending CSRs and use the watch cache by passing a ListOptions with a server-side selector and ResourceVersion "0". Update the hcClient.CertificatesV1().CertificateSigningRequests().List(ctx, metav1.ListOptions{}) invocation to pass metav1.ListOptions{FieldSelector: "<selector-for-pending-csrs>", ResourceVersion: "0"} (or a label selector if you add a label), so the API returns only unapproved/undecided CSRs and uses the apiserver cache; keep the rest of the code that processes csrList unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 222-231: The PEM block returned by pem.Decode(csr.Spec.Request)
isn't validated for type before attempting x509.ParseCertificateRequest, so add
a guard after pemBlock is non-nil that checks pemBlock.Type == "CERTIFICATE
REQUEST" (or the exact expected PEM type) and return a clear error if it does
not match; this prevents opaque ASN.1 parse errors when calling
x509.ParseCertificateRequest on the wrong PEM type.
- Around line 261-282: validateServingIdentities currently only ensures hostname
appears somewhere in certReq.DNSNames and allows extra SANs; change it to
require that certReq.DNSNames contains exactly one entry equal to the node
hostname (i.e., check len(certReq.DNSNames) == 1 and certReq.DNSNames[0] ==
hostname) and return an error otherwise; keep the existing IP address loop using
hasValidIPAddress(certReq.IPAddresses) unchanged and reuse the same error-return
pattern for invalid IPs.
- Around line 46-75: ValidateBootstrapCSR currently doesn't assert that parsed
certificate request SAN fields are empty; after parseCSRRequest(csr) add
explicit checks that certReq.DNSNames, certReq.IPAddresses,
certReq.EmailAddresses, and certReq.URIs are all empty and return a clear error
(e.g., "bootstrap CSR must not contain SANs: <which-field>") if any are
non-empty; place these checks before validateOrganization() and validateCN() so
malformed/malicious SANs are rejected early and include the specific certReq
field name in the error to aid debugging.
- Around line 77-106: ValidateServingCSR is missing CN validation like
ValidateBootstrapCSR does; after parsing the CSR with parseCSRRequest (certReq)
add a call to validateCN using the parsed certReq and the hostname, and return a
wrapped error (e.g., "CN validation failed: %w") if it fails, ensuring the CN
must be "system:node:<hostname>" to match ValidateBootstrapCSR's behavior.
- Around line 168-214: Both validateBootstrapUsages and validateServingUsages
currently only ensure required usages are present but do not reject
extra/disallowed usages; update each function to also iterate csr.Spec.Usages
and error if any usage is not in the allowed set (for validateBootstrapUsages
allowed set = {certv1.UsageDigitalSignature, certv1.UsageClientAuth}, for
validateServingUsages allowed set = {certv1.UsageDigitalSignature,
certv1.UsageServerAuth}). Return a clear fmt.Errorf when an unexpected usage is
found. Optionally factor the logic into a shared helper like validateUsages(csr
*certv1.CertificateSigningRequest, allowed map[certv1.KeyUsage]bool) used by
both functions to remove duplication.
In `@internal/controller/csrapproval/csr_owner_validation_test.go`:
- Around line 46-197: Tests create DPUs with mgmtClient.Create but call
mgmtClient.Delete at the end of the It blocks, which can leave orphaned
resources if an assertion fails; after every mgmtClient.Create call register a
Ginkgo DeferCleanup immediately (e.g. DeferCleanup(func(){
mgmtClient.Delete(ctx, dpu) })) so cleanup always runs, update each It that
creates dpu/dpu1/dpu2/dpu3 (the tests that call NewValidator and
validator.dpuExists) to defer cleanup right after the corresponding Create, and
remove the trailing inline Delete calls.
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 309-313: Move the a.recorder.Event(dpfhcp, eventType, reason,
message) call so it happens only after a successful persist: call
a.mgmtClient.Status().Update(ctx, dpfhcp) first, check that it returned nil, and
only then emit the event; if Status().Update returns an error, return that error
without recording the event (to avoid ghost events). Ensure you keep the same
dpfhcp, eventType, reason, message values and preserve existing error
propagation behavior.
---
Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 207-214: TestConnection currently calls
clientset.Discovery().ServerVersion() which ignores the passed ctx; replace that
call with a context-aware REST request using the discovery REST client
(clientset.Discovery().RESTClient()). Perform a GET to the version endpoint
(e.g., AbsPath("/version") or the equivalent discovery path), execute the
request with Do(ctx) and handle the returned error (or unmarshal the response)
so the passed ctx deadline/cancellation is honored; update the TestConnection
function to use this RESTClient.Do(ctx) flow instead of ServerVersion().
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 293-317: The setCondition method in CSRApprover should use
meta.SetStatusCondition to atomically find/update/append the
CSRAutoApprovalActive condition on dpfhcp.Status.Conditions and only emit events
and call a.mgmtClient.Status().Update(ctx, dpfhcp) when that call returns true;
update setCondition to construct the metav1.Condition (Type
provisioningv1alpha1.CSRAutoApprovalActive, Status, Reason, Message,
ObservedGeneration dpfhcp.Generation), call changed :=
meta.SetStatusCondition(&dpfhcp.Status.Conditions, condition), and if changed
then call a.recorder.Event(dpfhcp, eventType, reason, message) (with eventType
corev1.EventTypeWarning for metav1.ConditionFalse else corev1.EventTypeNormal)
and persist via a.mgmtClient.Status().Update(ctx, dpfhcp); otherwise return nil.
- Around line 199-217: The current loops call a.processCSR(ctx, dpfhcp,
hcClient, &csr, "...") but ignore its new (bool, error) result so approved CSRs
are not being counted; declare a local approvedCount (int) before iterating
bootstrapCSRs and servingCSRs, capture the first return value (e.g., approved,
err := a.processCSR(...)), increment approvedCount when approved == true, and
after processing both sets log or expose approvedCount for observability;
reference processCSR, bootstrapCSRs, servingCSRs and the local approvedCount
variable when making the change.
- Around line 89-101: The current code correctly invalidates the cached
hosted-cluster client on TestConnection failure, but please ensure the
invalidation call occurs before setting the condition and returning; verify the
presence of a.clientManager.InvalidateClient(dpfhcp.Namespace, dpfhcp.Name) in
the TestConnection error branch and that the function then calls
a.setCondition(...) and returns ctrl.Result{}, nil (no requeue); if you find
these steps out of order or missing in csrapproval.go (the TestConnection error
handling block), move/add the InvalidateClient call so it runs before
setCondition and the early return.
---
Nitpick comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 56-79: GetHostedClusterClient uses double-checked locking and can
create duplicate clients: after creating a client with createHostedClusterClient
and before storing it into cm.hcClients, re-acquire the write lock (cm.mu.Lock)
and re-check cm.hcClients[key]; if an entry now exists, return that existing
client and dispose/close the newly created one (if it supports closing) instead
of unconditionally overwriting; otherwise store the newly created client and
return it. Ensure you use the same key computed in GetHostedClusterClient and
the same lock fields cm.mu and map cm.hcClients in the fix.
- Around line 172-205: The function replaceServerWithInternalEndpoint has unused
parameters ctx and mgmtClient; remove those parameters from its signature and
definition so it only accepts (kubeconfig *clientcmdapi.Config,
hostedClusterNamespace, hostedClusterName string) (or reorder to match current
usage), update all call sites to stop passing a context or management client,
and run tests/compile to ensure no remaining references to
replaceServerWithInternalEndpoint(ctx, mgmtClient, ...) remain.
In `@internal/controller/csrapproval/controller.go`:
- Around line 101-107: The two controllers both watch DPFHCPProvisioner and
status-only updates cause both to reconcile; update CSRApprovalReconciler's
SetupWithManager to add an event filter so status-only changes (like
CSRAutoApprovalActive flips) don't trigger reconciliation for the other
controller — for example replace the current
ctrl.NewControllerManagedBy(mgr).For(...) chain in SetupWithManager with one
that calls WithEventFilter(predicate.GenerationChangedPredicate{}) or a custom
predicate (predicate.Funcs) that ignores updates where only the
status/conditions changed, or that only requeues when the CSRAutoApprovalActive
condition itself changes; modify the SetupWithManager function in
CSRApprovalReconciler accordingly.
In `@internal/controller/csrapproval/csr_content_validation_test.go`:
- Around line 180-327: The tests for ValidateServingCSR are missing a case that
asserts validateOrganization(certReq) fails when the CSR organization is not
"system:nodes"; add a new spec similar to the other Context blocks that builds a
CSR with createServingCSRWithDNS using a non-matching organization (e.g.
"wrong-org"), sets Spec.Username to "system:node:"+hostname and Groups to
include/not include as needed, calls ValidateServingCSR(csr, hostname) and
expects an error containing a message about invalid organization (or the exact
substring produced by validateOrganization); use a descriptive test name like
"When CSR has invalid organization" to locate it alongside the other Contexts
and reference ValidateServingCSR and validateOrganization in the test for
clarity.
- Around line 27-177: Add a negative test that covers step 5 (CN/hostname
validation) by creating a CSR whose organization and other fields are valid but
whose common name (CN) does not match the provided hostname; call
ValidateBootstrapCSR(csr, hostname) and assert it returns an error and that
err.Error() contains a substring indicating the CN/hostname mismatch (e.g., "CN"
or "common name"); place this new test as a Context similar to the other
negative cases and reuse helpers like createTestCSRWithOrganization and
constants such as SignerBootstrap and BootstrapperUsername so it exercises
validateCN(certReq, hostname).
In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 107-113: The current code marks the condition as ConditionTrue
with ReasonCSRApprovalActive even when processPendingCSRs returns an error; add
a new reason constant (e.g., ReasonCSRApprovalDegraded) to the API constants and
use it in the error branch so setCondition(ctx, dpfhcp, metav1.ConditionTrue,
ReasonCSRApprovalDegraded, fmt.Sprintf(...)) is called when err != nil (leave
the existing ReasonCSRApprovalActive for the successful/healthy path); update
any references/tests to the reason name accordingly so operators can distinguish
active+healthy vs active+degraded.
- Around line 153-156: Currently all CSRs are fetched then filtered in-memory;
change the List call to request only pending CSRs and use the watch cache by
passing a ListOptions with a server-side selector and ResourceVersion "0".
Update the hcClient.CertificatesV1().CertificateSigningRequests().List(ctx,
metav1.ListOptions{}) invocation to pass metav1.ListOptions{FieldSelector:
"<selector-for-pending-csrs>", ResourceVersion: "0"} (or a label selector if you
add a label), so the API returns only unapproved/undecided CSRs and uses the
apiserver cache; keep the rest of the code that processes csrList unchanged.
In `@internal/controller/csrapproval/hostname_test.go`:
- Around line 296-390: The three helpers createTestCSR,
createTestCSRWithOrganization, and createServingCSRWithDNS duplicate key
generation, x509.CreateCertificateRequest and PEM encoding; consolidate them by
adding a single helper (e.g., new createCSR helper) that takes parameters for
commonName (cn string), organizations ([]string) and dnsNames ([]string) and
performs rsa.GenerateKey, x509.CreateCertificateRequest and pem.EncodeToMemory
once, returning []byte; then change createTestCSR, createTestCSRWithOrganization
and createServingCSRWithDNS to call createCSR with the appropriate args
(nil/empty slices where not needed) so the duplicated logic is removed while
keeping the same panic-on-error behavior for tests.
ac343f3 to
934fad8
Compare
| key := namespace + "/" + name | ||
| cm.mu.Lock() | ||
| delete(cm.hcClients, key) | ||
| cm.mu.Unlock() |
There was a problem hiding this comment.
Should we have a guard condition here in case the key is not in the map?
There was a problem hiding this comment.
Hi Jason, delete is safe. ref - https://go.dev/ref/spec#Deletion_of_map_elements
|
What prevents an adversary from getting their CSRs approved? |
Hi @omertuc , I took inspiration from the CSR approval logic in assisted-service. You can review the validations I added in this PR, both the content validation for each CSR type and the owner validation. |
934fad8 to
004de50
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jkary, linoyaslan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
004de50 to
ded029e
Compare
- Dedicated controller: CSRApprovalReconciler polls hosted clusters every 30s - CSR validation: Hostname matching, DPU ownership verification, certificate content checks - Client caching: ClientManager caches kubeconfig clients per hosted cluster - Independent operation: Runs alongside main controller
ded029e to
38dfdf7
Compare
There's a lot that seems wrong with how various OCP / OCP related systems approve CSRs in my opinion, but we don't know their threat models and what kind of considerations / limitations they had when they chose their solutions. Regardless, even if those solutions were secure and reasonable, I don't think we can simply copy them verbatim, as maybe doing what they did in their circumstances made sense but doesn't necessarily make sense in ours. Maybe we're less limited than them and there's more we can do to ensure security. For example, in the example you linked they're limiting their CSR approval process to nodes that are in known states. Since these states are very time-limited, maybe the attack window is not big enough to be a concern. However, in this PR, if I understand correctly, we're indefinitely approving any CSR.
Anyone can write anything in these fields. This will protect against random nodes trying to join by accident, but not against someone maliciously (yet trivially) crafting CSRs they want to get approved |
|
Thanks @omertuc! so iiuc, you think we need the following here:
|
I didn't suggest that, I just mentioned that as a an example of a difference between the system we copied from and our system, a difference which is critical when determining whether this is a good solution. Limiting the controller to a time window will make things better but still my question remains - what stops an adversary from getting their CSRs approved by this controller
IDK, we need to define our goals / our threat model and what we want to protect from. We can even say something like "we don't care about CSR security because nobody except devices with physical access to the machine (like the DPU itself) can even contact the hypershift control plane" (I don't know if that's true or not - if it's not, we should probably do that as well. If it is then we don't really necessarily have to worry about CSR security because the entire hosted cluster is fully internal and inaccessible from outside)
We've experimented [1] before with injecting the private key used by kubelet to generate its CSR in the node ignition, then comparing the CSR public key against its matching private key before approving the CSR. We could theoretically do something similar here [1] openshift/assisted-installer#895 - ignore the PR title and description, they're outdated - but the diff shows what I'm referring to. It never made it to assisted though |
|
@omertuc I totally understand your concerns here but we can start with this option. It will for sure minimize attack surface. |
|
@tsorya as discussed, a dedicated tickets opened for phase validation (NVIDIA-575) and for the mechanism @omertuc raised (NVIDIA-576) |
Implements CSR approval for DPU worker nodes joining hosted clusters.
Implementation
CSRApprovalReconcilerpolls hosted clusters every 30sClientManagercaches kubeconfig clients per hosted clusterArchitecture
Two controllers in same manager pod:
dpfhcpprovisioner- Main provisioning logiccsrapproval- CSR auto-approval pollingExample of log output
Summary by CodeRabbit